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Abstract 

Background: Measur'mg similarity between diseases plays an Important role in disease-related molecular function research. 
Functional associations between disease-related genes and semantic associations between diseases are often used to 
identify pairs of similar diseases from different perspectives. Currently, it is still a challenge to exploit both of them to 
calculate disease similarity. Therefore, a new method (SemFunSim) that integrates semantic and functional association is 
proposed to address the issue. 

Methods: SemFunSim is designed as follows. First of all, FunSim (Functional similarity) is proposed to calculate disease 
similarity using disease-related gene sets in a weighted network of human gene function. Next, SemSim (Semantic 
Similarity) is devised to calculate disease similarity using the relationship between two diseases from Disease Ontology. 
Finally, FunSim and SemSim are integrated to measure disease similarity. 

Resu/ts:The high average AUC (area under the receiver operating characteristic curve) (96.37%) shows that SemFunSim 
achieves a high true positive rate and a low false positive rate. 79 of the top 100 pairs of similar diseases identified by 
SemFunSim are annotated in the Comparative Toxicogenomics Database (CTD) as being targeted by the same therapeutic 
compounds, while other methods we compared could identify 35 or less such pairs among the top 100. Moreover, when 
using our method on diseases without annotated compounds in CTD, we could confirm many of our predicted candidate 
compounds from literature. This indicates that SemFunSim is an effective method for drug repositioning. 
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Background 

The quantitative measurement of similarity between diseases 
based on qualitative association [1-5] raises more and more 
attention, because it plays an important role in predicting disease- 
causing genes [6,7], inferring micro RNA function associations [8], 
and identifying novel drug indications [9]. Currently, there is a 
critical need to design methods to measure disease similarity. 

Methods for calculating disease similarity can be broadly 
classified as semantic-based [8,10] and function-based [11-13]. 
Semantic-based methods are widely used for measuring similarity 
between terms of Gene Ontology (GO) [14,15] and human 
phenotype ontology (HPO) [16] in the biomedical and bioinfor- 
matics domain. Few of them are used for calculating similarity 
between terms of disease-related ontologies. For computing the 
similarity of GO terms, Resnik's method [17] has a better 
performance evaluation result [18] than union-intersection (UI), 
longest shared path (LP), JC [19] and Lin [20]. Resnik's method 
has also been used to calculate the similarity between terms of 



Disease Ontology (DO) [10,21], measuring disease similarity based 
on the information content (IC) (Figure SI and File SI) of the most 
informative common ancestor (MICA) (Figure SI and File SI) 
between two terms. In addition, Wang et al.'s method [22] 
calculates similarity between terms considering multiple common 
ancestors. It performs very well for computing the semantic 
similarity between GO terms [22], and has been successfully used 
for measuring disease similarity between medical subject headings 
(MeSH) [23] terms and inferring microRNA function network [8]. 

Function-based methods calculate disease similarity by com- 
paring disease-related gene sets [11-13]. Mathur and Dinakar- 
pandian [1 1] designed the similarity method based on overlapping 
gene sets (BOG) between diseases of DO. In comparison to 
semantic-based methods, the BOG method defines disease 
similarity from a new perspective. Therefore, it is possible to find 
unknown relationships [11]. However, it ignores the functional 
associations between disease-related genes which contribute to 
disease similarity. In another method, Mathur et al. [13] presented 



PLOS ONE I www.plosone.org 



1 



June 2014 | Volume 9 | Issue 6 | e99415 



A New Method for Measuring Disease Similarity 



a process-similarity based (PSB) method by involving the 
associations based on GO [14] terms. PSB outshines BOG, and 
its performance is better than Resnik [17], Lin [20], LC [24] and 
JC's [19] methods [13]. Functional associations between genes 
involve multiple aspects, .such as co-expression [25], protein- 
protein interaction [26], GO terms [27], etc. However, the PSB 
method only exploits the associations from GO terms. Therefore, 
the performance would likely be better if multiple associations 
were considered for calculating disease similarity. 

There are many disease-related vocabularies, some of which 
describe semantic associations between diseases by 'IS_A' 
relationship (Figure 1), such as MeSH, DO, etc. Among them, 
DO is an ontology to organize vocabularies around diseases 
themselves [2 1] . And it integrates disease and medical vocabular- 
ies through extensive cross mapping [21]. Other vocabularies 
often include not only diseases themselves, but also terms of 
pathology, anatomical, etc. For example, MeSH is a more 
comprehensive ontology that has been classified as 16 categories. 
In these categories, only categories C and F03 define terms around 
disease. However, not all the terms in these categories are named 
for diseases themselves, such as pain (D010146). Furthermore, DO 
has been validated to be suitable for calculating chsease similarity 
[11,13,28]. Therefore, we choose DO as disease terminology to 
describe disease terms for calculating disease similarity. 

Function-based methods calculate disease similarity according 
to functional associations between genes. Semantic-based methods 
exploit associations from ontologies and the number of disease- 
related genes to compute disease similarit)'. Obviously, not all 
associations between diseases are represented by the ontology, a 
part of them are reflected through functional associations among 
disease-related genes and vice versa. In this paper, a new method 
(SemFunSim) is proposed, which integrates semantic and gene 
functional association for measuring similarity between diseases. 

Materials and Methods 

Disease Ontology 

DO [21] (Table 1) contains 8,632 disease terms and 7,232 
'IS_A' relationships among diseases. The directed acychc graph 
(DAG) of DO represents terms linked by 'IS_A' relationship, of 
which a node represents a DO term and an edge represents an 
'IS_A' relationship between diseases. Figure 1 shows a sub-graph 
of the DAG starting from the specific DO term 'Cutaneous lupus 
erythematosus (DOID:0050169)' and ending at the root term of 
DO. 



names in these sources have been converted to HUGO Gene 
Nomenclature Committee (HGNC) approved gene symbols [36]. 

Disease similarity 

Figure 2 gives an overview of SemFunSim. In the figure, di and 
d2 are two diseases from DO, and dMicA is the MICA of di and 
dj. Gi, G2 and Gmica are gene sets related to di, d2 and dMicAj 
respectively. First, a weighted network of human gene function 
association is used for calculating FunSim (functional similarity) 
between Gi and G2. Then, semantic associations from DO are 
used to calculate semantic similarity (SemSim) between diseases. 
Finally, FunSim and SemSim are integrated into SemFunSim. 

Functional similarity between disease-related gene 
sets. Gene function networks are widely used to understand 
disease [29,37-43]. We accessed the interactions of genes from 
HumanNet [29], which has been used to understand associations 
across three GO categories [44] . Each interaction of HumanNet 
has an associated log likelihood score (LLS) that measures the 
probability of a functional linkage between genes [29]. We 
normalized the associated LLS with equation 1. 

LLSNigi,gj)= _Trv 

where gi and gj indicate the 2th and jth gene, respectively. 
LLSff(£i,gj) represents LLS between gj and gj after normaliza- 
tion. LLS(gi,gj) represents LLS between gi and gj. LLSmm and 
LLSmd\ are the minimum LLS and the maximum LLS of 
HumanNet, respectively. 

The functional similarity score between a pair of genes is 
defined as FunSim(gi,gj): 

r 1 '=j 

FunSim(gi,gj)= I LLSrf(gi, gj) i¥=j and e(ij)Et(HumanNet) 

(2) 

yo and e(iJ)^t{HumanNet) 



In equation 2, e(ij) represents the interaction edge between 
gene pair g, and gj. E(HumanNet) is a set which includes all the 
edges of HumanNet. 

Then, we define the functional association between a gene g 
and a gene set G={gi,g2, • • • ,gk} as Fgig), which is described in 
equation 3. 



HumanNet and disease-related gene set 

We accessed functional interactions of genes from HumanNet 
[29], which is an extended gene functional interaction network for 
Homo sapiens. Multiple distinct lines of evidence, spanning 
human mRNA co-expression, protein-protein interaction, protein 
complex, and comparative genomics data sets, in combination 
with similar lines of evidence from orthologs in yeast, fly and worm 
are comprehensively analyzed for the network using a probabilistic 
method [29]. This function network contains 476,399 interactions 
among 16,243 genes (Table 1). 

Disease-related gene sets are from SIDD [30] , which integrates 
five disease-related gene databases: GeneRIF [31], Online 
Mendelian Inheritance in Man (OMIM) [32], comparative 
toxicogenomics database (CTD) [33], genetic association database 
(GAD) [34], and SpliceDisease [35]. In total, 2,817 diseases, 
12,063 genes and 1 17,190 associations between them are involved 
(Dataset SI). The data sources were downloaded from the web in 
Jul 2013, and the detailed information is listed in Table 1. Gene 



FG{g)= max {FunSim{g,g,)), gieG (3) 

l<i<k 

where k indicates the number of genes in G, gi is the ixh gene of G. 

Let a pair of gene sets Gi={gii,gu,- ■ ■ ,gi„} and 
G2 = {g2i,g22, • • • ,§27,} be related to diseases di and d2, respective- 
ly, m is the number of genes in Gi, and n is the number of genes in 
G2. We define FunSim otdi and d2 in equation 4 as follows. 

E FG2(gu)+ E FG^(g2j) 

FunSim{G\,G2)= ^'^^ , ,.s 

m+n V*) 

gueGu g2jeG2 

Semantic similarity based on Disease Ontology. We 

define semantic similarity between disease pair d\ and d2 in 
equation 5. 
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Disease 
D01D:4 



Immune system disease 
DOID:2914 



Disease of anatomical entity 
D0ID:7 



Hypersensitivity reaction disease 
DOID:0060056 



i 




Hypersensitivity reaction type 11 disease 
D01D:417 



lupus erythematosus 
D01D:8857 



Integumentary 
DOI 


system disease 
D:16 


i 





Skin disease 
DOID:37 



Cutaneous lupus erythematosus 
DOID:0050169 



Figure 1 . A sub-graph of the DAG for DO term 'Cutaneous lupus erythematosus (DOID:00501 69)'. The arrow symbol represents an 'IS_A' 
link of DO. For example, "Cutaneous lupus erythematosus (DOID:0050169)" is linked to "Skin disease (DOID:37)" by an 'IS_A' relationship. 
doi:1 0.1 371/journal.pone.009941 S.gOOl 



A threshold for significant similarity of the 916 diseases with 
potential therapeutic chemicals (PTCs) in CTD is defined based 
on randomized data as follows. First, the 916 disease names in the 
DAG of DO were randomly shufHed, and the hierarchical 
structure remained the same as the original DO. Next, gene 
names in HumanNet were randomly shuffled, and the network 
topology remained the same as the original HumanNet. Then, the 
similarity scores for pairs of these 916 diseases were computed by 
SemFunSim based on the randomized data. The experiment was 
iterated 1000 times. Finally, we calculate the false discovery rate 
(FDR) over all pairs according to equation 7. 





Data source 


Web site (Date of download) 


DO 


https://diseaseontology.svn.sourceforge.net/svnroot/diseaseontology/trunk/ (Apr 201 3) 


SIDD 


http://mlg.hit.edu.cn/SIDD (Jul 2013) 


CTD 


http://ctdbase.org/downloads/;jsessionid = 71BC29A1A48AD67BADA2E2C4FC9625F3 (Apr 2013) 


HumanNet 


http://www.functionalnet.org/humannet/download.html (Jul 2013) 


^^^^^^^^^ http://www.geneontology.org/GO.downloads.ontology.shtml (Jul 2013) 


GOA 


http://www.geneontology.org/GO.downloads.annotatlons.shtml (Jul 201 3) 


MimMiner 


http://www.cmbi.ru.nl/iVlimMiner/suppl.html (Feb 2014) 


doi:l 0.1 371 /journal.pone.009941 S.tOOl 
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SemSim(di Jj) = J*^'^ , ' 77!^^ (5) 

where Gi and G2 are gene sets related to d] and ^2, respectively. 
Gmica is gene set related to dMicA, which represents the MICA of 
d\ and dj in the DAG ofDO. |Gi|, IG2I, and IG^f/Cy^l represent the 
number of genes in G\, G2 and Gmica, respectively. 

Similarity between disease pair by SemFunSim. The 
similarity between disease pair di and c/2 is defined in equation 6. 

Sim{d\ ,^2) = FunSim(Gi,G2)'SemSim(di ,^2) (6) 

where di and di are two diseases of DO. Gi and G2 are gene sets 
related to di and fl?2, respectively. 

Table 1. Data sources used for measuring disease similarity. 
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Figure 2. Overview of SemFunSim. d,, d2 are two diseases, and dMicA is the MICA of d, and d2. G,, G2 and Gmica represent gene sets related to di, 

d2 and diyicA, respectively. 

doi:1 0.1 371/journal.pone.009941 5.g002 



1000 

where Sinir represents a similarity score, A^, indicates the number 
of hits in the Jth permutation with the similarity score > Simj, 
and A^T" is the number of hits in the real case with the similarity 
score & Simr- 

Results and Discussion 

Validation of disease similarity methods on benchmark 
set 

We calculated similarities of disease pairs on a benchmark set 
and another 100 random sets. The performance of SemFunSim 
was accessed by drawing a receiver operating characteristic (ROC) 



[45] curve. In Figure 3A, two types of disease pair sets are 
introduced as input in the vahdation process. On one hand, two 
manually checked datasets [12,13,46] of disease pairs with high 
similarity were integrated into a benchmark set. One dataset was 
obtained from diseases analyzed in the study by Suthram et al 
[12]. Disease parrs of the dataset were marked as similar after 
validation from literature by Mathur et al [13]. The other dataset 
was derived from the judgment of medical residents for semantic 
similarity, and pairs of similar diseases were extracted by 
Pakhomov et al [46]. In total, 47 diseases and 70 pairs of these 
two disease pair datasets were merged as the benchmark set 
(Dataset S2). On the other hand, each random set contains 700 
disease parrs randomly selected from DO. 

In order to further test the performance of the proposed 
method, SemFunSim was compared with disease similarity 
methods including Resnik [17], Wang [22], BOG [11], and PSB 
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Pairs of similar diseases 
verified from two papers 



Suthram 



Pakhomov 
et al. 



Disease Ontology 




Input 



Benchmark set 



- d:., 
-d:,2 



d,,„- 


d,,i 


dui- 


d,.,2 


di.ira 


- d2,770 



SemFunSim 



Output 



di.i ~ d2,i -- Sinii 
d] 2 — di 2 -- Sim2 



d: disease 

PTCs: potential tlierapeutic chemicals 
Sim: similarity between diseases 



Raw data 
916 diseases with PTCs 



Input 



Output 



PTCs, -- PTCS2 - Common PTCs 
PTCs, -- PTCs; - Common PTCs 



Common PTCs 




Hypergeometric 
test 



SemFunSim 



^ SemFunSim 



PTCsi -- PTCS2 - Common PTCs -- P-value 
PTCsi -- PTCs;, -- Common PTCs -- P-value 



PTCS915 -- PTCS916 — Common PTCs — P-value 



d, ~ d2 ~ Simi,2 
d| ~ dj - Simi.j 



dqi5 ~ dijifi ~ Simi),i,9i(, 



di - d,i7 - PTCs, - Sim,,, 
d, -d,is- PTCs, - Sim,.. 



dgift - d<)60 - PTCs9,(, - Sim9, 



B 



Figure 3. The process of validation. A. The similarities of disease pairs from the benchmarl< set and 100 random sets were calculated by 
SemFunSim, FunSim, Resnik, Wang, BOG, and PSB. B. The similarities of all the disease pairs between 91 6 diseases with PTCs in CTD were measured by 
SemFunSim, FunSim, Resnik, Wang, BOG, and PSB. In addition, the similarities of all the disease pairs between these 916 diseases with PTCs and 44 
diseases without PTCs in CTD were computed by SemFunSim. 
doi:1 0.1 371 /journal.pone.009941 5.g003 



[13]. During the experiment, tlie parameters of these methods are 
selected according to the original paper. 

Similarities of disease pairs of the benchmark set and a random 
set were calculated by SemFunSim. We examined whether 
similarities of disease pairs of benchmark set could be prioritized 
in the top to produce an ROC curve. In Figure 4A, the area under 
the ROC curve (AUC) of each method is listed as follows, Resnik 
(63.14%), Wang (68.04%), BOG (78.10%), PSB (89.52%), and 
SemFunSim (96.36%). FunSim is part of SemFunSim, and has an 
AUG of 94.37%. The AUC shows that Wang et al.'s method is a 
littie better than Resnik's method. The BOG method has the worst 
performance among function-based methods. When linking genes 
based on the GO biological process category [14] by the PSB 
method, the result has been improved significantly. Although the 
PSB method shows a very high AUC, FunSim stUl improves the 



results of the PSB method by about 5%. After integrating gene 
functional and semantic association, the SemFunSim method 
improves the performance further to nearly 100%. This experi- 
ment was iterated 100 times by calculating similarities of 100 
random sets and the benchmark set. In Figure 4B, the average 
AUC of the 100 permutations is 0.6345, 0.6784, 0.7657, 0.8984, 
0.9415, and 0.9637 for Resnik, Wang, BOG, PSB, FunSim, and 
SemFunSim, respectively. The result is consistent with Figure 4A. 

Currently, functionally relevant gene associations can be 
defined in multiple ways (e.g. annotations for co-expression [25], 
protein-protein interaction [26], etc.). However, only one or two 
types of gene functional associations have been used to calculate 
the similarity by BOG and PSB [11,13]. FunSim was designed for 
calculating disease similarity based on a comprehensive weighted 
gene functional association network. In Figure 4, the AUC of 
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Figure 4. AUC analysis of the benchmark set and random sets. A. ROC curves for the experimental results on the benchmark set and a 
random set. It shows 1 -specificity versus sensitivity of each method for calculating the similarities of disease pairs. B. Average of AUC for 100 
permutations. 

doi:10.1371/journal.pone.0099415.g004 



FunSim is higher than BOG and PSB. The results show that 
comprehensive gene functional association is suitable for calculat- 
ing disease similarity. 

Among the five methods, Resnik's method used the IC of the 
MICA to calculate similarity between diseases. A few disease pairs 
of the benchmark set have only one common ancestor node, 
consequendy the similarities of these diseases are zero according to 
Resnik (File SI). For example, the similarity between disease pair 
'diabetes mellitus (DOID:9351)' and 'Alzheimer's Disease 
(DOID:10652)' is zero (File SI), because the MICA of these two 
diseases is the root node of DO (Figure SI), and the IC of the root 
node is zero. To avoid this problem for pairs of similar diseases 
with only one common ancestor, the IC is not used for measuring 
disease similarity in SemSim. The ROC curves in Figure 4A show 
clearly that SemFunSim has the highest AUC, which validates that 
the integrated semantic association helps to enhance the true 
positive rate and reduce the false positive rate. 

Assessment of disease similarity by means of common 
therapeutic compounds 

CTD (Table 1) [33] was introduced to compare PTCs for 
diseases (Figure 3B). CTD not only documents disease-related 
genes, but also documents disease-related markers and potential 
therapeutic compounds for diseases. Only potential therapeutic 
compounds for diseases were extracted as PTCs. In a previous 
study, disease terms of CTD were integrated with DO [30] . After 
extracting PTCs for diseases from CTD, 916 diseases, 3,522 
chemicals and 11,134 associations were retained (Dataset S3). In 
addition, 44 diseases without PTCs in CTD were also kept. 

In order to illustrate the point that similar diseases can often be 
treated with similar drugs [9,47-49], PTCs for the top 100 pairs of 
similar diseases (TlOO-PSDs) and top 100 pairs of dissimilar 
diseases (TlOO-PDDs) (Dataset S4) identified using SemFunSim 



were compared. We counted the number of pairs with common 
PTCs and used a hypergeometric test to calculate the P-value for 
common PTCs for each pair of diseases. The P-value was adjusted 
by FDR [50]. There are 419,070 pairs between these 916 diseases. 
1,251 pairs of them can be linked to each other by an 'IS_A' 
relationship of DO, which were not compared for avoiding 
diseases with common PTCs caused by the inclusion relationship. 
The results of the comparison are shown in Figure 5. 79 pairs of 
the TlOO-PSDs can be treated with common PTCs and 43 pairs 
have an adjusted P-value <0.05. In comparison, only 1 pair of the 
TlOO-PDDs can be treated with common PTCs and no pair has 
an adjusted P-value <0.05. The results show that the higher the 
similarity of a pair of diseases, the more likely they can be treated 
with common PTCs. Therefore, SemFunSim confirms the 
assumption that similar diseases can often be treated with similar 
drugs [9,47-49]. 

We further compared the PTCs for the TlOO-PSDs identified 
by the five methods (Dataset S5). The results are shown in Figure 6. 
2, 15, 29, 31, 35, 79 pairs of the TlOO-PSDs identified by BOG, 
PSB, Resnik, FunSim, Wang and SemFunSim respectively can be 
treated with common PTCs, and 0, 4, 19, 17, 10, 43 pairs of the 
TlOO-PSDs identified by BOG, PSB, Resnik, FunSim, Wang and 
SemFunSim respectively have an adjusted P-value <0.05. FunSim 
is part of SemFunSim and is designed by considering compre- 
hensive gene functional association. It identifies a higher number 
of pairs of diseases with common PTCs than BOG and PSB. It 
shows that disease similarity calculated by comprehensive gene 
function association is appropriate for taking advantage of the fact 
that similar diseases can often be treated with similar drugs [9,47— 
49]. The SemFunSim method identifies more than twice the 
number of pairs with common PTCs than the other methods. This 
confirms that SemFunSim is very suitable for the task. 

The same test was applied to the top 500 pairs of similar 
diseases (T500-PSDs) and the top 1000 pairs of similar diseases 
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TlOO-PDDs (top 100 pairs of dissimilar diseases) 
adjusted P-value > 0.05 (1%) 



TlOO-PSDs (top 100 pairs of similar diseases) 




adjusted P-value 
> 0.05 (36%) 




adjusted P-value 
< 0.05 (43%) 



without common PTCs (99%) 
A 



without common PTCs (2 1 %) 
B 



Figure 5. The number of pairs of diseases identified using SemFunSim witKi common PTCs. A. The number of pairs of the T1 00-PDDs with 
common PTCs. B. The number of pairs of the TlOO-PSDs with common PTCs. The yellow area represents the number of pairs without common PTCs. 
The pink area indicates the number of pairs with common PTCs and adjusted P-value aO.05. The light blue area represents the number of pairs with 
common PTCs and adjusted P-value <0.05. 
doi:1 0.1 371 /journal.pone.009941 S.gOOS 



(TlOOO-PSDs) identified by the five methods (Dataset S5). The 
results are shown in Table SI. In the table, 57, 247, 281, 308, 457, 
and 556 pairs of the TlOOO-PSDs identified by BOG, Resnik, 
Wang, PSB, FunSim, and SemFunSim respectively can be treated 
with common PTCs. And 9, 99, 90, 104, 170, and 237 pairs of the 
TlOOO-PSDs identified by BOG, Resnik, Wang, PSB, FunSim, 
and SemFunSim respectively have an adjusted P-value <0.05. 
The performance of Resnik, FunSim and Wang appears to be 
roughly the same in the TlOO-PSDs. After comparing more pairs 
of similar diseases (T500-PSDs and TlOOO-PSDs), FunSim 
performs better than Resnik and Wang (Table SI). The 
experimental results in Table SI show that SemFunSim has an 
advantage over other compared methods. 

Using random permutations of the functional gene network and 
the 916 diseases with PTCs in CTD, as described in the Methods 
section, we defined thresholds for significant similarity. We found 
that 448 pairs of diseases have a similarity score above 0.06060 at 
an FDR less than 0.05, and 6,981 pairs of diseases have a 
similarity score above 0.00111 at an FDR less than 0.10. The 
FDRs for pairs of diseases with the similarity score above 0.001 1 1 
are listed in Dataset S6. The threshold can be defined as 0.06060 
(FDR <0.05). In addition, researchers can also adjust the 
tiireshold to validate more disease parrs, such as 0.00111 (FDR 
<0.10). 

In an early study, van Driel et al. [51] developed a tool 
(MimMiner), which was extensively used to calculate similarity 
between phenotype terms from OMIM [52]. We obtained the 
similarity score between 5,080 OMIM phenotype records from 
MimMiner (Table 1). As mentioned before, CTD includes 916 
diseases with PTCs. 127 common diseases between the 5,080 
OMIM phenotype records and these 916 diseases (Dataset S7) 
were found through DO's extensive cross mapping [21]. Then, 



SemFunSim and MimMiner were compared on the basis of these 
127 diseases. 

The result of the comparison is shown in Figure 7. 39, 129, and 
218 pairs of the TlOO-PSDs, T500-PSDs, and TlOOO-PSDs 
identified by MimMiner respectively can be treated with common 
PTCs. And 17, 52, and 79 pairs of the TlOO-PSDs, T500-PSDs, 
and TlOOO-PSDs respectively have an adjusted P-value <0.05. In 
comparison, 74, 271, and 441 pairs of the TlOO-PSDs, T500- 
PSDs, and TlOOO-PSDs identified by SemFunSim respectively can 
be treated with common PTCs. And 43, 100, and 130 pairs of the 
TlOO-PSDs, T500-PSDS, and TlOOO-PSDs respectively have an 
adjusted P-value <0.05. Result shows that similar diseases 
identified using SemFunSim are very likely to be treated with 
common drugs. 

We further compared MimMiner and SemFunSim based on 
their thresholds. 53 pairs of the 127 common diseases identified by 
MimMiner have a similarity >0.4 (threshold of MimMiner) 
(Dataset S7). 23 (43.4%— 23/53) of them can be treated with 
common PTCs, and 9 (1 7.0% — 9/53) have an adjusted P-value < 
0.05. In comparison, 107 pairs of the 127 diseases identified by 
SemFunSim have a similarity >0.00111 (threshold of SemFun- 
Sim) (Dataset S7). 78 (72.9%— 78/107) of them can be treated 
with common PTCs, and 44 (41.1% — 44/107) have an adjusted 
P-value <0.05. The experiment results based on these 127 diseases 
show that SemFunSim's performance in measuring disease 
similarity is better than MimMiner's. 

Prediction of novel therapeutic applications of known 
compounds 

SemFunSim was used to find PTCs for 44 diseases without 
PTCs in CTD. First, as shown in Figure 3B, we calculated 
similarities of 40,304 pairs between these 44 diseases and 916 
diseases with PTCs in CTD (Dataset S8). In order to avoid 
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Figure 6. The number of pairs of similar diseases identified using the five methods with common PTCs. Blue bar indicates the number 
of pairs with common PTCs. Red bar represents the number of pairs with common PTCs and adjusted P-value <0.05. 
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Figure 7. The number of pairs of similar diseases identified using MimMiner and SemFunSim with common PTCs. A. The number of 
pairs of the top pairs of similar diseases with common PTCs. The red bar represents the number of pairs with common PTCs measured by 
SemFunSim. The blue bar indicates the number of pairs with common PTCs measured by MimMiner. B. The number of pairs of the top pairs of similar 
diseases with common PTCs and adjusted P-value <0.05. The red bar represents the number of pairs with common PTCs and adjusted P-value <0.05 
measured by SemFunSim. The blue bar indicates the number of pairs with common PTCs and adjusted P-value <0.05 measured by MimMiner. 
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Table 2. Top 20 pairs of similar diseases. 



Order 


Diseases with PTCs in CTD 


Diseases without PTCs in CTD 


Similarities 


1 


Liver Cirrhosis 


Hepatopulmonary syndrome 


0.03460 


2 


agranulocytosis 


lymphopenia 


0.01665 


3 


neutropenia 


lymphopenia 


0.01566 


4 


macroglobulinemia 


alpha 1 -antitrypsin deficiency 


0.01424 


5 


hepatitis 


hepatopulmonary syndrome 


0.00887 


6 


Wilson disease 


hemochromatosis 


0.00862 


7 


systemic scleroderma 


polymyalgia rheumatica 


0.00717 


8 


drug-induced hepatitis 


hepatopulmonary syndrome 


0.00710 


9 


myasthenia gravis 


lambert-eaton myasthenic syndrome 


0.00644 


10 


dilated cardiomyopathy 


restrictive cardiomyopathy 


0.00643 


n 


sarcoidosis 


cryoglobulinemia 


0.00607 


12 


berylliosis 


asbestosis 


0.00600 


13 


berylliosis 


extrinsic allergic alveolitis 


0.00575 


14 


intestinal disease 


hepatopulmonary syndrome 


0.00564 


15 


placenta disease 


bacterial vaginosis 


0.00499 


16 


hyperthyroidism 


congenital hypothyroidism 


0.00461 


17 


biliary tract disease 


hepatopulmonary syndrome 


0.00454 


18 


bile duct disease 


hepatopulmonary syndrome 


0.00452 


19 


inflammatory bowel disease 


hepatopulmonary syndrome 


0.00447 


20 


primary biliary cirrhosis 


hepatopulmonary syndrome 


0.00421 



The first column is the descending order number of similarity between diseases. The second column represents diseases with PTCs in CTD. The third column indicates 
diseases without PTCs in CTD. The fourth column represents the similarities between pairs of diseases in the second and third columns. 
doi:1 0.1 371 /journal.pone.009941 5.t002 



diseases with common PTCs caused by the inclusion relation- 
ship, 64 of the 40,304 pairs which can be linked with each other 
by an TS_A' relationship of DO were not included. Each pair of 
the 40,240 pairs includes one disease without PTCs in CTD and 
one disease with PTCs in CTD. The top 20 pairs of similar 
diseases (T20-PSDs) (Table 2) contain 12 diseases without PTCs 



in CTD. Then, we searched PubMed to find PTCs for these 12 
diseases. According to the idea that similar diseases can often be 
treated with similar drugs, the PTCs for one disease in pair of 
similar diseases can be used as a reference for the other without 
PTCs. For example, 'systemic scleroderma' is similar with 
'polymyalgia rheumatica', 1 1 PTCs for the former are docu- 



Table 3. Associations between PTCs and diseases retrieved from PubMed. 



Order 


PTCs 


Diseases with PTC in CTD 


Diseases without PTC in CTD 


Similarities 


PMIDs 


1 


Pentoxifylline 


Liver Cirrhosis 


Hepatopulmonary syndrome 


0.03460 


23002364 [57] 


5 


Acetylcysteine 


hepatitis 


hepatopulmonary syndrome 


0.00887 


18341514 [58] 


7 


Azathioprine 


systemic scleroderma 


polymyalgia rheumatica 


0.00717 


2750226 [53] 


7 


Methylprednisolone 


systemic scleroderma 


polymyalgia rheumatica 


0.00717 


1768166 [54] 


7 


Prednisolone 


systemic scleroderma 


polymyalgia rheumatica 


0.00717 


8523341 [55] 


7 


Prednisone 


systemic scleroderma 


polymyalgia rheumatica 


0.00717 


1 5466766 [56] 


8 


Pentoxifylline 


drug-induced hepatitis 


hepatopulmonary syndrome 


0.00710 


23002364 [57] 


9 


Prednisolone 


myasthenia gravis 


lambert-eaton myasthenic 
syndrome 


0.00644 


10555101 [59] 


11 


Methylprednisolone 


sarcoidosis 


cryoglobulinemia 


0.00607 


6851261 [60] 


13 


Prednisone 


berylliosis 


extrinsic allergic alveolitis 


0.00575 


9489437 [61] 


16 


Methimazole 


hyperthyroidism 


congenital hypothyroidism 


0.00461 


22672871 [62] 


19 


Acetylcysteine 


inflammatory bowel disease 


hepatopulmonary syndrome 


0.00447 


18341514 [58] 



The first column is the descending order number of similarity between diseases. PTCs (in the second column) for diseases (in the third column) are documented in CTD. 
The fourth column represents diseases without PTCs in CTD. The fifth column indicates the similarities between pairs of diseases in the third and fourth column. The 
sixth column is the PubMed IDs that record the associations between PTCs (in the second column) and diseases (in the fourth column). 
doi:l 0.1 371/journal.pone.009941 5.t003 
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mented in CTD. We searched from PubMed for finding 
associations between these 1 1 PTCs and 'polymyalgia rheuma- 
tica'. And we found that four of them were also PTCs for 
'polymyalgia rheumatica', such as azathioprine [53], Methyl- 
prednisolone [54], Prednisolone [55] and Prednisone [56]. 
Finally, 6 of these 12 diseases from the T20-PSDs can be treated 
with PTCs confirmed by literature. The detailed results are 
listed in Table 3, which indicate that SemFunSim is an effective 
method to find PTCs for diseases. 

Conclusions 

In this article, we devise an algorithm (SemFunSim) to measure 
disease similarity by integrating FunSim and SemSim effectively. 
Experimental evaluation was performed on the benchmark set and 
100 random sets from DO. The high average AUC (96.37%) 
shows that SemFunSim achieves a high true positive rate and a 
low false positive rate. 

SemFunSim is in agreement with the notion that similar 
diseases can often be treated with similar drugs [9,47-49]. 
SemFunSim not only helps to understand associations between 
diseases, but also provides an effective way to predict PTCs for 
diseases. We found associations between diseases and PTCs that 
were not documented in CTD using SemFunSim (Table 3). 
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